Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Free, publicly-accessible full text available September 30, 2026
-
Abstract In this paper, we explore the non-asymptotic global convergence rates of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixonโs equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.more » « less
-
We study the problem of finding an ๐-first-order stationary point (FOSP) of a smooth function, given access only to gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O(๐โ7/4). In this work, we propose a method with a gradient complexity of O(๐1/4๐โ13/8), where ๐ is the problem dimension, leading to an improved complexity when ๐ = O(๐โ1/2). To achieve this result, we design an optimization algorithm that, underneath, involves solving two online learning problems. Specifically, we first reformulate the task of finding a stationary point for a nonconvex problem as minimizing the regret in an online convex optimization problem, where the loss is determined by the gradient of the objective function. Then, we introduce a novel optimistic quasi-Newton method to solve this online learning problem, with the Hessian approximation update itself framed as an online learning problem in the space of matrices. Beyond improving the complexity bound for achieving an ๐-FOSP using a gradient oracle, our result provides the first guarantee suggesting that quasi-Newton methods can potentially outperform gradient descent-type methods in nonconvex settings.more » « lessFree, publicly-accessible full text available June 15, 2026
-
Stochastic second-order methods are known to achieve fast local convergence in strongly convex optimization by relying on noisy Hessian estimates to precondition the gradient. Yet, most of these methods achieve superlinear convergence only when the stochastic Hessian noise diminishes, requiring an increase in the per-iteration cost as time progresses. Recent work in \cite{na2022hessian} addressed this issue via a Hessian averaging scheme that achieves a superlinear convergence rate without increasing the per-iteration cost. However, the considered method exhibits a slow global convergence rate, requiring up to ~O(ฮบ^2) iterations to reach the superlinear rate of ~O((1/t)^{t/2}), where ฮบ is the problem's condition number. In this paper, we propose a novel stochastic Newton proximal extragradient method that significantly improves these bounds, achieving a faster global linear rate and reaching the same fast superlinear rate in ~O(ฮบ) iterations. We achieve this by developing a novel extension of the Hybrid Proximal Extragradient (HPE) framework, which simultaneously achieves fast global and local convergence rates for strongly convex functions with access to a noisy Hessian oracle.more » « lessFree, publicly-accessible full text available December 10, 2025
-
Free, publicly-accessible full text available December 1, 2025
-
Stochastic second-order methods accelerate local convergence in strongly convex optimization by using noisy Hessian estimates to precondition gradients. However, they typically achieve superlinear convergence only when Hessian noise diminishes, which increases per-iteration costs. Prior work [arXiv:2204.09266] introduced a Hessian averaging scheme that maintains low per-iteration cost while achieving superlinear convergence, but with slow global convergence, requiring ๐ ~ ( ๐ 2 ) O ~ (ฮบ 2 ) iterations to reach the superlinear rate of ๐ ~ ( ( 1 / ๐ก ) ๐ก / 2 ) O ~ ((1/t) t/2 ), where ๐ ฮบ is the condition number. This paper proposes a stochastic Newton proximal extragradient method that improves these bounds, delivering faster global linear convergence and achieving the same fast superlinear rate in only ๐ ~ ( ๐ ) O ~ (ฮบ) iterations. The method extends the Hybrid Proximal Extragradient (HPE) framework, yielding improved global and local convergence guarantees for strongly convex functions with access to a noisy Hessian oracle.more » « lessFree, publicly-accessible full text available November 11, 2025
-
Free, publicly-accessible full text available December 10, 2025
-
An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an r-dimensional subspace within the d k r-dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of d. In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all r ground-truth features.more » « less
-
Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear ModelsThe gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $$\mathcal{O}(d^3)$$, where $$d$$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $$\mathcal{O}(d^2)$$. When the SNR is low, for GLMs with a polynomial link function of degree $$p$$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $$p$$. Also, for the empirical loss with $$n$$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $$p$$, the iterates of BFGS reach a final statistical radius of $$\mathcal{O}((d/n)^{\frac{1}{2p+2}})$$ after at most $$\log(n/d)$$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.more » « less
-
Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence RateSecond-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second order updates within a lower-dimensional subspace, giving rise to \textit{subspace second-order} methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $$d$$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $$\bigO\left(\frac{1}{mk}+\frac{1}{k^2}\right)$$ for solving convex optimization problems. Here, $$m$$ represents the subspace dimension, which can be significantly smaller than $$d$$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.more » « less
An official website of the United States government

Full Text Available